PCI-DMA/CPU Handoff for Increased Effectiveness of Checkpointing Functionalities in CCL

نویسندگان

  • Andrea Santoro
  • Francesco Quaglia
چکیده

Checkpointing and Communication Library (CCL) is a recently developed software in support of optimistic parallel discrete event simulation on myrinet clusters. Beyond low latency message delivery functionalities, CCL also offers non-blocking checkpointing functionalities supported by a programmable PCI DMA engine on board of myrinet cards. CCL employs a re-synchronization functionality between PCI DMA activities and CPU activities to maintain the consistency of checkpointed information (i.e. to prevent the CPU from updating information that still needs to be copied through DMAing). If re-synchronization is invoked before the checkpoint operation is completed, simulation activities carried out by the CPU may be forced to wait for checkpoint completion. Since data copy through the PCI DMA is slower than what achievable with the CPU, in pathological situations a re-synchronization period may last more than a whole checkpoint operation performed by the CPU, thus nullifying the potential benefit from offloading checkpointing from the CPU. This paper tackles such an issue by presenting the design and implementation of a handoff mechanism of checkpoint operations between PCI DMA and CPU to enhance the effectiveness of checkpointing functionalities offered by CCL. Although a checkpoint operation is initially entrusted to the PCI DMA, whenever re-synchronization forces the simulation application to wait for its completion, the checkpoint operation is dynamically switched to the CPU, namely the fastest available device, since its timely completion has become a performance critical task for the simulation application.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tuning of the Checkpointing and Communication Library for Optimistic Simulation on Myrinet Based NOWs

Recently a Checkpointing and Communication Library (CCL) for optimistic simulation on Myrinet based Network of Workstations (NOWs) has been presented. CCL ofloads checkpoint operations from the CPU by charging them to a programmable DMA engine on the Myrinet network card. CCL includes also functionalities for freezing the simulation application on demand, which can be used for data consistency ...

متن کامل

Multiprogrammed non-blocking checkpoints in support of optimistic simulation on myrinet clusters

CCL (Checkpointing and Communication Library) is a software layer in support of optimistic Parallel Discrete Event Simulation (PDES) on myrinet-based COTS clusters. Beyond classical low latency message delivery functionalities, this library implements CPU offloaded, non-blocking (asynchronous) checkpointing functionalities based on data transfer capabilities provided by a programmable DMA engin...

متن کامل

Benefits from Semi-asynchronous Checkpointing for Time Warp Simulations of a Large State Pcs Model

Checkpointing overhead is a major obstacle for the effectiveness of Time Warp parallel discrete event simulators. Semi-asynchronous checkpointing is a recent solution to tackle this obstacle for Time Warp simulations on distributed memory systems based on Myrinet. In this solution, checkpoint operations are offloaded from the host CPU and are charged to a DMA engine on board of Myrinet network ...

متن کامل

A Study of Disk Performance Optimization

A STUDY OF DISK PERFORMANCE OPTIMIZATION by Richard S. Gray Response time is one of the most important performance measures associated with a typical multi-user system. Response time, in turn, is bounded by the performance of the input/output (I/O) subsystem. Other than the end user and some external peripherals, the slowest component of the I/O subsystem is the disk drive. One standard strateg...

متن کامل

Performance and Effectiveness Analysis of Checkpointing in Mobile Environments

Many mathematical models have been proposed to evaluate the execution performance of an application with and without checkpointing in the presence of failures. They assume that the total program execution time without failure is known in advance, under which condition the optimal checkpointing interval can be determined. In mobile environments, application components are distributed and tasks a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003